Now that we have our cleaned data in a tidy format let’s do some analyses. First, here are a few more simple examples of chaining code to select, filter, and arrange our data to obtain different subsets.
Using Dplyr with broom and ggplot2
One of the best aspects of working with tidy data and dplyr is how easy it makes it to quickly manipulate and plot your data. Property organized, it’s a piece of cake to quickly make summaries and plots of your data without making all kinds of “temporary” files or lines of spaghetti code for plotting. You can also basically eliminate loops from your coding for all situations except that those that require dynamic updating (e.g. population models).
For this next exercise, we’re going to use tidyr, dplyr, broom, and ggplot2 to fit a model, run diagnostics, and plot results.
It’s 3am. You’ve been chasing the same cryptic error message for two days ( Error: towel not found, don't panic!). You decide enough is enough: you’re going to pack it in, buy a boat and become a fisherman. The only problem is, years of coding have left you with no knowledge of the outside world besides what R and data can tell you. How are you supposed to know what to fish for, or where to fish? Luckily, you have some data, so you turn to your laptop one last time before hurling it off of a cliff in a ritualistic sacrifice to the sea gods.
You want to find a fishery to join based on two criteria: high average catch, and low average variability. You might now know these data though, so you want to be able to predict what fishery to join based on geographic and life history traits.
Our first goals:
Generate a unique ID for each fishery
Calculate the mean log lifetime catch of each fishery
Calculate the coefficient of variation of each fishery
Filter out fisheries with short time series
# Prep our data
dat <- d %>%
ungroup() %>% #Often a good idea to ungroup before starting something new
mutate(id = paste(country,spcode,regionfao, sep = '_')) %>% #Generate a unique ID for each fishery
group_by(id) %>%
mutate(mean_log_catch = mean(logcatch, na.rm = T), cv_log_catch = sd(logcatch, na.rm = T)/mean(logcatch, na.rm = T), length_catch = sum(is.na(logcatch) == F & logcatch >0)) %>% # we want to keep some of the other data as well
filter(year == max(year) & length_catch > 10 & is.finite(mean_log_catch) == T & cv_log_catch >0) %>% # We don't want repeated entries, so let's just grab one random year
dplyr::select(-year, -catch, -logcatch)
# Always plot!
ggplot(dat, aes(mean_log_catch,cv_log_catch)) +
geom_point()

OK, we see we’re onto something here: there’s clearly a relationship between average catch and the CV of the catch. We want to build a model that predicts that. Let’s create a composite score of the mean log catch and the inverse of the CV. We’re going to scale the log catches by the maximum log catch, and the CV by the the maximum of 1/CV. We also want to add in our nice life history data
regdat <- dat %>%
ungroup() %>% #we want global statistics now
mutate(scaled_ml_catch = mean_log_catch/max(mean_log_catch), scaled_cv_catch = (cv_log_catch/min(cv_log_catch))^-1, fishiness = scaled_ml_catch + scaled_cv_catch) %>%
left_join(lh, by = 'sciname')
regplot <- regdat %>% #great thing about ggplot is the ability to save as an object and use and modify later
ggplot(aes(mean_log_catch,cv_log_catch, fill = fishiness)) +
geom_point(shape = 21) +
scale_fill_gradient(low = 'red',high = 'green')
regplot # grea

Now we’re getting somewhere! Now, lets run a regression using life history and geographic variables to try and predict the quality of fishing.
reg_vars <- c('regionfao', 'spgroupname', 'vbk','maxl','temp') #specify variables you want
class(regdat$regionfao) #whoops, it things FAO region is an integer, we want a factor
filtered_dat <- regdat %>%
ungroup() %>%
mutate(has_all = apply(is.na(regdat[,reg_vars]) == F, 1,all)) %>%
filter(has_all == T) %>%
mutate(regionfao = as.factor(regionfao),spgroupname = as.factor(spgroupname))
reg_fmla <- as.formula(paste('fishiness ~',paste(reg_vars, collapse = '+'), sep = '')) #create regression formula
fish_model <- lm(reg_fmla, data = filtered_dat) #run a linear regression
summary(fish_model)
Now we’ve got a model! we’re close to being able to use data to predict where we’ll start our fishing operation. But, while we know nothing about fishing, we are good statisticians, and we know we should look at our regression before using it to make a big life decision. This is where broom comes in. R has all kinds of great functions, like summary() to look at regressions. But, they can be a little ad hoc, and difficult to manipulate. broom helps us tidy up our regression data. First, suppose that we want a better way to look at summary statistics from the regression. The glance() function from the broom package extracts important summary statistics from the model, like the R2, the AIC, and the BIC.
library(broom)
reg_summary <- glance(fish_model)
reg_summary
Unfortunately, our model is pretty poor; it only explains ~20% of the variation in the fishiness variable, but hopefully it’s better than guessing. Let’s dig into this model a bit more. We’re going to use the tidy() function from the broom package to provide neat summaries of the model coefficients.
tidy_model <- tidy(fish_model)
tidy_model$variable<- as.factor(tidy_model$term) #convert terms to factors
tidy_model$variable <- reorder(tidy_model$variable, tidy_model$p.value) #sort variables by pvalue
tidy_model$short_pval<- pmin(tidy_model$p.value,0.2) #create abbreviated version
regression_plot <- (ggplot(data=tidy_model,aes(x=variable,y=estimate,fill=short_pval))+
geom_bar(position='dodge',stat='identity',color='black')+
scale_fill_gradient2(high='black',mid='gray99',low='red',midpoint=0.1,
breaks=c(0.05,0.1,0.15,0.2),labels=c('0.05','0.10','0.15','>0.20')
,name='P-Value',guide=guide_colorbar(reverse=T))
+theme(axis.text.x=element_text(angle=45,hjust=0.9,vjust=0.9))+
geom_errorbar(mapping=aes(ymin=estimate-1.96*std.error,ymax=estimate+1.96*std.error))+
xlab('Variable')+
ylab(paste('Marginal Effect on ',names(fish_model$model)[1],sep='')) +
coord_flip())
regression_plot

So, we can now see that most of the significant terms are region specific, and the life history data doesn’t give us a whole lot of information on where we should start fishing. So far, the model is saying go fish in China, and maybe avoid salmons, halibuts, and tunas.
Before we charge off and use these results though to decide where we’re starting our new life, we’re now going to use the augment() function in the broom package to help us run some diagnostics on the regression. The augment function takes our original data passed to the regression, and adds all kinds of things, like the values predicted by the model and the residuals. This makes it very useful for regression diagnostics. First off, we might want to check whether our errors are actually normally distributed
auged_reg <- augment(fish_model)
obs_v_pred <- auged_reg %>%
ggplot(aes(fishiness, .fitted)) +
geom_point(shape = 21, size = 4, alpha = 0.6, fill = 'steelblue4') +
geom_abline(aes(slope=1, intercept = 0)) +
xlab('ovbserved') +
ylab('predicted') +
geom_label(aes(0.25,0.7), label = paste('R2 = ', round(reg_summary$r.squared,2), sep = ''))
obs_v_pred

qq_plot <- auged_reg %>% #create quantile-quantile plot
ggplot(aes(sample = .resid)) +
stat_qq(shape = 21, size = 4, alpha = 0.6, fill = 'steelblue4') +
xlab('Theoretical') +
ylab('Sample')
qq_plot

We see that our data are in fact normally distributed, that’s good! Let’s check for heteroskedasticity and model misspecification.
hetsk_plot <- auged_reg %>% #plot fitted vs residuals
ggplot(aes(.fitted, .resid)) +
geom_point(shape = 21, size = 4, alpha = 0.6, fill = 'steelblue4') +
geom_hline(aes(yintercept = 0)) +
xlab('Predicted') +
ylab('Residuals')
hetsk_plot
Looks a little iffy, we’ve got some heteroskedasticity going on. Let’s try and see where it is. The great thing about broom is that it makes it really easy to manipulate data and plot diagnostics based on the original data.
hetsk_plot2 <- auged_reg %>%
ggplot(aes(.fitted, .resid, fill = spgroupname)) +
geom_point(shape = 21, size = 4, alpha = 0.6) +
geom_hline(aes(yintercept = 0)) +
xlab('Predicted') +
ylab('Residuals')
hetsk_plot2

So, we see here that the culprit are the herrings and salmons. That tells us to be a little cautious in our predictive ability and estimated errors based on this model, and maybe we need to do a better job of clustering our errors. Let’s look at things another way. We saw from the coefficient plot that the region effects are the most significant in the model. How confident are we in those?
regional_bias <- auged_reg %>% #Check residuals by group
ggplot(aes(regionfao,.resid)) +
geom_boxplot(fill = 'steelblue4') +
geom_hline(aes(yintercept = 0)) +
xlab('FAO Region') +
ylab('Residuals')
regional_bias

species_bias <- auged_reg %>%
ggplot(aes(spgroupname,.resid)) +
geom_boxplot(fill = 'steelblue4') +
geom_hline(aes(yintercept = 0)) +
xlab('Species Category') +
ylab('Residuals') +
coord_flip()
species_bias

All in all then, we’ve got some heteroskedasticity that makes us a little suspicious of our standard errors, but no major biases in our estimation. Our life choice model works! Let’s move to China and fish whatever, the model says it doesn’t matter.
A quick warning on speed
So far, we’ve been preaching the dplyr gospel pretty hard. All in all, it makes code faster, more efficient, and much easier to read. But, there are times when its best to keep it simple, especially where speed is critical. This is less dplyr’s fault, than some issues with data frames themselves.
We are going to compare two functions that do the same thing, one using dplyr and data frames and one that uses more basic R functions. The goal is a function that calculates the mean length of catch history in an fao region
dplyr_fun <- function(region,dat)
{
out <- dat %>%
filter(regionfao == region) %>%
summarise(mean_length = mean(length_catch))
return(out)
}
basic_fun <- function(region,dat)
{
out <- mean(as.numeric(dat[dat[,'regionfao'] == region,'length_catch']))
return(out)
}
regions <- rep(unique(as.character(regdat$regionfao)), 100) #thing to test
startime <- proc.time() #time the dplyr version
a <- lapply(regions, dplyr_fun, dat = regdat)
t1 <- proc.time() - startime
startime <- proc.time() #time the basic version
b <- lapply(regions, basic_fun, dat = as.matrix(regdat))
t2 <- proc.time() - startime
t1[1]/t2[1]
## user.self
## 11.76132
all(plyr::ldply(a)$V1 == plyr::ldply(b)$V1) #check and make sure they do the same thing
## [1] TRUE
The dplyr version of the function takes nearly 7 times as long as the same function in basic notation! The difference between .45 and 3.1 seconds doesn’t matter much in most cases, but if you’re doing huge numbers of simulations, say in an MCMC, this starts to add up. This can be the difference between a model running a day and a few hours.
This time sink doesn’t always hold true, dplyr will often be faster than bunches of nested loops, but when speed is a priority, it’s worth checking to see using matrices instead of data frames and dplyr will save you some serious time.